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THE ASYMPTOTIC DISTRIBUTION OF A CLUSTER-INDEX FOR 
I.I.D. NORMAL RANDOM VARIABLES 

By Yannis G. Yatracos 

National University of Singapore 

In a sample variance decomposition, with components functions 
of the sample's spacings, the largest component I„ is used in clus- 
ter detection. It is shown for normal samples that the asymptotic 
distribution of /„ is the Gumbel distribution. 

1. Introduction. Clusters are nowadays data structures of considerable 
interest: Microarray data is used to attribute genes in clusters; gene expres- 
sion is used to cluster tumors and identify similar types of cancer. Extreme 
value theory, in particular of sample spacings, has been used extensively in 
modeling phenomena. The extreme value /„ of functions of spacings is intro- 
duced in Yatracos (2007) to detect data clusters from their one dimensional 
data projections; n is the size of the data. In this work, the asymptotic dis- 
tribution of /„ is obtained for data from the normal distribution, and can 
be used to determine statistical significance of potential clusters. 

Consider a sequence Xi,. . . , Xn of independent identically distributed 
random variables with cumulative distribution function F. Let be the 
ith. order statistic, i = 1, . . . ,n. Define the spacing 

•Sj = ^(i+i) - ^(i) , i = l,...,n-l, 

the maximum spacing 

Mn = max{S'i, i = 1, . . . ,n - 1} = M^'^ 

and the A;th largest spacing M^it^ ,fc = l,...,n — 1, 

(k) 

The large sample behavior of M„ and Mn , that is, their asymptotic 
distribution, large deviation properties and a.s. behavior has been studied 
for various choices of F by Pyke (1965), Slud (1977/78), Devroye (1981, 
1982, 1984), Deheuvels (1982, 1983, 1984, 1985) and other authors. 
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When F = the cumulative distribution function of a standard normal 
A^(0, 1) random variable, it is shown herein that the asymptotic distribution 
of 

In = maxjS'iTj, i = 1, . . . , n — 1} 
is a standard Gumbel distribution; 

- _ i{n-i) (%+l,n] 

X[j j] is the average of the order statistics from i to j, i < j. 

Results in Deheuvels (1985) are crucial to obtain the result. 

SiTi is the ith component in a standardized sample variance decompo- 
sition, i = 1, . . . ,n — 1 [Yatracos (1998)]. The largest component in the de- 
composition, /„, determines two least homogeneous sample clusters. For 
multivariate data, is used to determine two clusters with the least ho- 
mogeneous one-dimensional data projection [Yatracos (2007)]. Significance 
with respect to the normal model is justified since for many high dimensional 
data sets to find unusual projections one should search for nonnormality 
[Diaconis and Freedman (1984)]. 

2. The sample variance decomposition and In. Univariate observations 
Xi,. . . , Xn are usually separated in two clusters by comparing the stan- 
dardized difference of the group averages and respectively, of 
the i smaller observations X^i^j , ■ • ■ , -'^(j) and of the n — i larger observations 
X(j^i) , . . . , , for z = 1, . . . , n — 1. The spacing Si between the two groups 
may vary and it is natural to be used in a dissimilarity measure. The product 
^^^■J^^ {^[i+i.n] ~ is related to the sample variance [Yatracos (1998)] 

= H (^[i+l,n] - ^[l.i])^, 

i=l i=l 

and measures between-groups variation. The standardized variance compo- 
nents 

_ i{n-i) (^[»+i,n] - 

indicate the relative contribution of the groups , . . . , and ^(j+i) , . . . , 
X(^n) in tliG sample variability. 
The statistic 

/„ = max{Sifi,i = 1, . . . , n - 1} 

determines two potential clusters. When = SjTj, these clusters are Ci = 
, . . . , ^(j)}, C2 = , . . . , X(^n)} and the cluster separators are si = 

X(j),S2 = ^(j+l)- 
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3. The asymptotic distribution of In. 

Theorem 3.1. Let be i.i.d. standard normal random vari- 

ables, X S -R. Then it holds that 

(1) lim P[n/„ < a; + logn] = expj— exp{— x}}. 

n— ♦+00 

The proof of Theorem 3.1 is based on the four lemmas that fohow. It is 
enough to obtain the asymptotic distribution of 

( i{n — i) . - -,, .1 

^^2 \^[i+l,n] - %,n])(^(i+l) ~ ^{i)) p i = 1, ... ,71 - 1. 

Let be the ith order statistic with density gi,i = 1, . . . ,n. Let n+ (resp. 
n_) be the number of positive (resp. negative) observations. Then it holds 
that n_|_ ~ n_ ~ ^ a.s.; a„ ~ 6„ denotes lim.„^oo = 1- Due to the symmetry 
of the normal distribution, without loss of generality, the lemmas are proved 
for the posi^iwe observations < ^(n/2+i„),- • • , -^(n), -^{n/2+«„-i) < 0,^^ = o(n) 
can take either positive or negative values. One may think of the arguments 
as conditional on the value of 

Lemma 3.1. For e > and i = ^ + Z„, . . . , n, it holds that 

(2) - %)) >£]<(!- ee-'-^T-'- 

Proof. Recall that for any x > it holds 

^/ e\ , e(j)(x + e/x) 

\ X J X 



(x) < 1 - $(x) < 



1 + X^ X 

[Chow and Teicher (1988), page 49], 



(4) 
and thus 

$(x + £/x) -^(x) e(t>{x + e/x) _ o.SeV^'-^ 
^' l-$(x) - cPix) 

The Markovian property of Z^i^, . . . , Z(^^^ implies that given Z^^^ = z, the 
r.v.'s , . . . , form a sample from a standard normal distribution 

truncated at z and, therefore, 

EP(%(Z(,+i)-%)>e|%=x) 

(6) 

1 - ^{x + e/x) 



1 - ^>(x) 



n—i 



gi{x)dx. 
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For < X < i/e, 

'1 - $(x + e/x) 



l-$(x) 

_ -^(x + e/x){l - e/x2)(l - $(x)) + (^(x)(l - $(x + e/x)) 
~ (l-$(x))2 



>0, 



thus, 



1 - ^(x + e/x) 1 - ^{2^e) ^ ^{2^e) - ^^e) ^ 
^'^ l-cl>(x) - l-^^) 1-^V^) - 

The last inequality follows from (5) with x = ^/e. 
For X > it holds that 

(8) 

From (5) and (8), it follows that 



1 - <I>(x + e/x) _ ^ <I>(x + e/x) - <I>(x) 



(9) 



l-$(x) l-$(x) 

< 1 _ ^e~0.5eyx^~e ^ ^ 
Inequality (2) follows from (6), (7) and (9). □ 



■ ee 



-1.5e 



Lemma 3.2. Let Rr 



, Ti is a random variable whose 



</'(^(i)+^i(^(i+i)--^(i)) ' 
existence is guaranteed by Taylor's theorem, 0<Tj<l,i = ^ + Z„,...,n — 

1, kn ~ (log n) ^"'"^ , < C < 2, rUn — > +oo as n — > +oo. Then for small e > as 
n +00, it holds that 



P 



n 



sup<{ Ri,i = - + ln,...,n-knl>l + . 



(10) 

(11) P[sup{Ri,i = n-kn + l,...,n-l}> m„] - 
Proof. For f = ^ + . . . , n - 1, it holds that 

(12) 1 < -Ri < e°'^^^('+i)~^(*)^^"''^(')^^(^+i)"^(')^ 
In Deheuvels (1985), it is shown that for r]> 0, 



0, 
0. 



P[v/21ognmax{Z(j+i) - = l,...,n- 1} > (1 + r?loglogn) i.o.] =0; 

i.o. denotes "infinitely often." Thus, from (12), to prove (10) and (11), it is 
enough to prove respectively that as n ^ +oo, 



P 



sup 



n 



+1) - Z(i))ii = - + ln,---,n-kn>>e 



P[sup{2'(j)(2'(j+i) - Z(^i^),i = n- kn + 1, . . . , n - 1} > logm„] ^ 0. 
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P 



sup 



\ %)(%+!) - = ^ + ln,---,n-kn\> 



n 



<[^^-K-ln + lj{l- ee-^-'^'f- ^ 

as n ^ +00 since In = o{n)} 

For 61 > 0, < (1 + e)^2\ogn a.s. for n > n{e) and 

P[sup{Z(i)(Z(i+i) - Z(i)),i = n- kn + 1, . . . ,n - 1} > logm^ 



<P[(l + 0)V2Tog^ 

X sup{(Z(j4.i) - = n - A;„ + 1, . . . ,n - 1} > logmj. 

From Lemma 6 in Deheuvels (1985), the Kn = [(logn)'^] largest order 
statistics generate spacings which are miiformly close to {2\ogn)~^^'^Ej/j,j = 
1, . . . , Kn, where {Ej,j = 1, . . . , Kn} are i.i.d. exponential r.v.'s with mean 
1. Thus, it holds that 



P 



V2Togn(Z(j+i) - > 
jlogm„ 



P 



log win 

1 + e 

~jlogmn/{l+d) 



and 



P 



j = l,...,Kn 

logm„ 



V21ognsup{(Z(j+i) - Z(^i)),i = n-kn + l,...,n-l} > 

kji — 1 



i + e 

I _ g-(fcn-l)logm„/(l+6») 
I _ g-logm„/(l+£') 



< g-jlogm„/(l+6l) ^ g-logm„/(l+0) 
^g-logmn/(l+e) 

as n — > +00. □ 



Lemma 3.3. For any real x, as +00, 

n-l 

P n <a; + logn} 

i=n/2+ln 



Proof. Let kn ~ (logn)^+'^,0 < C < 2, m„ loglogn, 
(13) Aj = {(n-i)(Z(i+i) - Z(i))S(Z[j+i.„]|Z(j)) <rE + logn} 



-0.5e- 



^The result follows also from Deheuvels (1985) with fc„ = [(logn)^]. 
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and A'l its complement, i = ^ + ?„,..., n — 1. 



n-l 



fl Aini fl Ai 



■i=n/2+ln 
= P 



n 

■i=n/2+Z„ 



P 



n-l 



n U AI 



U=n/2+l„ 

and it is enough to show that as n — > +00 

n 

-i=n/2+ln 



\i=n-k„ + l 



-0.5e' 



and 



n-l 



u 



-i=n—kn+l 



Let be the ith order statistic from n i.i.d. uniform r.v.'s on (0, 1). Then 
it holds that 



(14) - = $"Hf/(i+i)) - 



(^(i+i) - ^(i)) 



with Zj = + Tj(Z('j_|_;^) — Z(^^-^),Ti is a random variable whose existence is 
guaranteed by Taylor's theorem, < Tj < 1. 

Given = z, the r.v.'s Z(j^i), . . . , form a sample from a stan- 
dard normal distribution truncated at z with mean and variance 



1 + 



l-$(2) U- 



Thus, it holds that 



and that 



n 



■ ^)(^(i+i) - %)-E^(^[i+i,n]l%) = (n - i) 



d^iz,) l-^(%)) 



V >^(n-i+l) / 



Ri is defined as in Lemma 3.2, V^n-i) = 1 ~ is the (n — i)th order 

statistic from i.i.d. uniform r.v.'s on (0, 1), and 
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are i.i.d. uniform random variables on (0,1) (see, e.g., David and Nagaraja 
(2003)). 

From Lemma 3.2, using = x + logn, it holds that 



n—k„ 

n 

■i=n/2+ln 



P 



n,{< 



n — i)\ 1 



■i=n/2+ln 



y[n-i) 



V, 



(n-i+1) 



n-kn 

n 

i=n/2+l„ 



1-1 



n — I 



71 kn 

~ n (i-e 

i=n/2+ln 

since In = o{n). 

From (11), it also holds that 

n-l 

U ^' 

=n—kn+l 



-D„ 



(1-e- 



'logn-x\n/2—k„-l„+l ^ g~0.5e 



< E p 

i=n-kn+l 



n-l 



E 1 



(n - i)mn 1 



Dn 

m„(n - i) 



Vi 



(n-i) 



V, 



{n-i+l) 



i=n—kn+l 

as n — > +CXD. □ 

Lemma 3.4. Let kn ~ (logn)^+^,0 <( <2. As +oo: 

(a) sup{(n-i)(Z(j+i) -Z(j))|Z|,i = § + /„,...,n-l}^0, 

(b) sup{(n - i)(%+i) - Z(i))|[Z[j+^„] - S(Z[i+i_„]|Z(i))]|,i = §+/„,..., 

- /Cn} ^ 0, 

(c) sup{(n - ^)(2'(^+l) - Z(i))| - ^(Z[j+i_„]|Z(i))]|,i = n - /c^ + 
l,...,n-l}^0, 

all in probability. 

Proof, (a) (4) implies that ^~^f}^'^ is decreasing for x > 0. Thus, it holds 



that 



n 



n 



sup<{ (n-i)(Z(i+i) - Z(i)),i = -+ln,...,n-lj\Z\ 

: sup|(n - i)(Z(i+i) - Z(j))^(Z[i+i_„]|Z(j)),i = + /„,... ,re - 1 
1 - $(0) 



^(0) 



-\z\^o 



Y. G. YATRACOS 



in probability as n — > +00; use Lemma 3.3 and limit theorems for Z. 



(b) Conditionally on .Z^(j), let 0"^^,, 



1 + 



^(»)0(%)) [ H^d)) 12 



note the variance of Z(j^„x), 
that 



]2 < 1, de- 



, i = 0.5n + In, ■ ■ ■ ,n — kn- Then it holds 



n 



sup<j (n- - -£(Z[i+i^„]|Z(i))|,i = - + /„,..., n - 

l-$(0) 



< 



sup<^ (n- - 



l%+l,n] - -^(%+l,n]l%))l J, 1 

X ,1 — — -\- In, ■ ■ ■ ,n Kn f ■ 



Let Si = E "=^+1 ^(i) - i5^(E"=i+i = 0.5n + 

5 > 0, it holds that 



_ ,^O.l l%+l,n] -^(%+l,n]l%)l ^ 



,n — kn- For 



(0 



^ •^0.4| ry 

\Jn — I 



< 



EP 



\S.\ 



< 



(Jz\J'n — i 
^P{\Z\>b{n-if-^\ 
2CiCj7 



>8{n-it^\Z^^^ = z 



P^Z\>b{n-i) 



0.4i 



+ C2 



,-0.5<52(n-i)°-' 

<5(n-i)0-4 



C(/ is the universal constant in the Berry-Esseen bound [see, e.g., Serfling 
(1980) or Ibragimov and Linnik (1971)], C2 is positive constant. The Marko- 
vian property of , . . . , Z(n) implies that given = z, the r.v.'s ^(i+i) , . . . , 
Z(n) form a sample from a standard normal distribution truncated at z, 
therefore, 



sup 



f £;(|Z(,-+i) -i? (Z(,-+i)|Z(,) = z)nZ(,) =z) 



3/2 



,z>0,j = i,...,n-l 



in the bound is replaced by its equivalent for the sample 



Ci is bounded since: 
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(i) for z > large, (4) implies z ~ ~ > ^) 

E[\Zi-EiZi\Zi>z)\^\Zi>z] 

3/2 

_ E[{{Zi - E{Zi\Zi > z)f)+\Zi > z] 

^ 3/2 
0-2 

_ E[{Zi - E{Zi\Zi > z)f\Zi > z] 

~ 3/2 ' 

where a"*" = max(a, 0), 

(ii) lim^^+oo ^K^i"-^(^ilfi>^))'l^i>^] = 2 [Nariaki and Akihide (1985)], 

3 

(iii) -^[(^i~-^(^il^i^>^)) l-^i>^l jg continuous function in z and, therefore, 

achieves its maximum in any compact [0,M],M > 0, and in particular for 
M large. 

Thus, as n — > +oo, it holds that 



i=0.5n+l„ 



_ .^O.l l^[»+l,n] -^(^[»+l,n]l%)l ^ ^ 



(0 



77. 

< E 

i=0.5n-|-i„ 



u:3 +'-'2 2^ 



(n — i)^ 



j=0.5n-|-(„ 
1 



(logn)0-3{i+C) (0.5n-Z„)0-3 
C3 is a positive positive constant, /„ = o{n). Let 

0.ll%+l,n] - -E'(^[i+l,n]l^( 



(n-i)0-4 
^0; 



in-i 



<:6>, i = 0.57l + ln, ■ ■ ■ iTl— kn- 



Using relations in the proof of Lemma 3.3, it follows that 



J2 P \ {n-i){Z(^i+i) -Z(i)) 



i=0.bn+ln 



n—kn 

< E p 

i=0.bn+ln 



(n - i)(%+i) - > |(n - i) 

n— (logn)-'-+'> 



n—kn 



i=0.5n+(„ 



0.1 
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= y Cly^e-^y] ° 7o"i i+c) + Cn / e""^ ^ 

" J(logn)0-i{l+C) lV(logn,)0.i(i+C) 

as n — > +oo; c = ^,ln = o{n),C^ is a constant, A; = 1, . . . , 11. 

(c) From Deheuvels (1985), for i = n — kn + 1, ■ ■ ■ ,n — 1, it holds that 

v/21ogri(n - - ~ En-f, 

Ej,j = 1, . . . ,kn — 1, are i.i.d. exponential with mean 1. From (4), it also 
holds that 

^[i+l,n] - -E'(%+l,n]l%)) 

- Z(i)) + (Z(i+2) - H h (^(„) - %)) 



n — I 



= (("-0(^+1) 

+ {n-i- l)(Z(i+2) - H h (%) - - i)"^ 

£■1 H h En-i ^ sup{£;j ,j = l,...,n-i} 

^ — — \ — 

^/2Togn{n — i) ~ \J1 logn 

^ sup{£j,j = l,...,A:n-l} 
~ V2 fog n 

Thus, 

P[sup{(n - - - £(Z[,+i^„]|Z(j))|, 

i = n — /c„ + 1, . . . , n — 1} > e] 



sup 



I -En-i SUpjE'j, J = 1, . . . , fcn - 1} 

1 a/2 logn a/2 logn 



z = n — /c„ + l,...,n — 1^ >e 
: P[sup{£;j, j = 1, . . . , /c„ - 1} > Vev/2fogn] 

: X _ (^1 _ g-V^A/Slogn-jfcn-l _^ 



as n ^ +CXD. □ 



Proof of Theorem 3.1. Conditionally on the value of n+, from the 
definition of (before Lemma 3.1) it holds that -Z'(„/2+z„-i) < < •^(n/2+/„)) 
and assume without loss of generality that </'(^(n/2+«„-i)) > 0(^(n/2+/„))- 
Note also that it holds 



i\n — I] , - - , n — I 



^2 (%+l,n] - ) - (^[i+l,n] - 

-(^-%,j]), i = l,...,n-l. 
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Let Ai be as in (13), Si = - Z^i)){-l)E{Z[i^i]\Z^i+i)) <x + logn}, 

2=1,...,^. Prom Lemma 3.3 and its proof, it follows directly for the A's, 
and by symmetry for the -B's that 



■ n/2 

n n 

.i=kn+l i=n/2+l 



and as n — > +00, 





- n/2 




n—kn 




n 


P 


n 




-i = kn + l 




-i=n/2+l 




n-1 







~ e 



U^f U ^1 



0. 



-i=l n— A;„+l 

The proof is completed using Lemma 3.4. □ 
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